home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Floppyshop 2
/
Floppyshop - 2.zip
/
Floppyshop - 2.iso
/
diskmags
/
5791-.end
/
dmg-6143
/
lza_rept
/
compres3.txt
< prev
next >
Wrap
Text File
|
1996-06-04
|
9KB
|
182 lines
Experiments in Data Compression III
(or "The Path Not Taken")
by John Kormylo
E.L.F. Software
State of the Art
In the following table are shown the results for compressing
three different files using three different methods. 'ZIP'
refers to STZIP 2.5 (deflating). 'LZA1' refers to the LZA method
used by ELFBACK v2.9. 'LZA2' refers to the LZA method used by
ELFARC v1.3.
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
ZIP | 25134 | 32417 | 496876 |
LZA1 | 24860 | 34719 | 585185 |
LZA2 | 22029 | 31997 | 453143 |
-------+--------------+-------------+--------------+
Back to Basics
Huffman and Arithmetic coding are fixed length to variable
length codes. They take advantage of the fact that some symbols
occur more often than others, and give them shorter codes than
the less common symbols. LZW is a variable length to fixed
length code, which attempts to create a dictionary of very long
sequences which all have the same probability of occurrence.
Whenever a sequence is used, the algorithm attempts to make it
longer.
Previous experience indicates that a variable length to
variable length code is best. There are certain sequences which
occur relatively often, but not necessarily together. For text
this should be obvious in that words are used many times but
usually in different combinations. Therefore making the sequence
longer is often not nearly as useful as reducing the number of
bits needed to code it.
LZA codes sequences using a length code (based on the
relative occurance of matches of that length) followed by an
alphabetical order code (based on how many matches of that length
were found in the dictionary). Whether to use one long match or
two short matches was determined by calculating the resulting
number of bits for all possible combinations. One would be hard
pressed to devise a more effective coding system.
The problem with LZA is that it is slow. In part this is
due to the binary tree search, which must perform many string
comparisons to locate the first and last matches corresponding to
a specified length. This is also due to the large dictionary of
possible sequences, from which the best combination of sequences
must be determined. There are 16K text pointers and 64 possible
lengths, for a maximum of 1 Meg possible strings in the
dictionary (only about half that number are unique).
A smaller dictionary would require less time to search and
might make the search for the best coding sequence unnecessary.
Hash Tables
Hash tables are faster than binary search trees. Also, each
entry in the table corresponds to a particular length of string,
and is the only entry corresponding to that string. Using the
index of the dictionary entry corresponds to a variable to fixed
length code (such as LZW).
For a variable to variable length code, one would also need
to keep a count of how many times that string was used. Because
of the 16 bit limit for the arithmetic coder, this means that the
total number of counts must be less than 64K. Needless to say,
the total number of strings in such a dictionary would have to be
much less than 1 Meg, probably less than 32K.
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
LZW | 32341 | 45147 | 905744 |
Count | 31399 | 43143 | 771499 |
Prune | 30574 | 39608 | 647192 |
Merge | 28090 | 36325 | 606602 |
Smart | 25974 | 35426 | 567752 |
-------+--------------+-------------+--------------+
The above table shows the results for several LZ78 variants.
The first uses LZW for all sequences and arithmetic for single
bytes. The second keeps track of the number of times each
sequence is used (variable to variable length coding). Both use
dictionaries with 8K entries.
One of the problems with LZW is that once a string has been
added to the dictionary, it can never be removed. Instead, the
entire dictionary is thrown out periodically and the whole
process begins again. However, it is possible to implement a
hash table where entries can be removed. Also, one can set the
count of an entry to 0 so that intermediate strings which are
'never' used will not affect the coding (although they will
continue to take up space in the hash table).
'Prune' in the above table reduces the count on every
entry whenever the table is full (or the total of the counts
reaches 64K). By doing all of the entries at once, we avoid the
problem of having to re-sort.
LZ78 and LZW extend the strings in the dictionary by one
character each time they are used. This has two problems
associated with it. First, the strings grow very slowly.
Second, in the process of adding a long string which is repeated
often, one will also form lots of short strings from the
leftovers at all the intermediate stages.
A much more effective approach is to extend a string by
every character in the following string when two strings occur
concurently. (Note, this means you cannot add strings to the
dictionary until they have been sent to the receiver.) The
results for this method are shown as 'Merge' in the above table.
It is not always best to choose the longest match available,
since the end may not line up with another string in the
dictionary. 'Smart' in the previous table shows what happens
when you choose the match length in order to code the most
characters using two consecutive matches. (Ties go to the match
with the highest count.) The best news is that this approach is
relatively fast.
LZ79 (combining LZ77 & LZ78)
It was observed that tweaking the algorithm produced very
little change in the performance. Also, the results were similar
to the 'backwards matching' algorithm discussed in the previous
chapter. In other words, this is about as much as you can do
with dictionary compression. The next logical thing to try is
adding a 'FIFO buffer.'
A 64 byte sliding window was added to the match search.
Lengths 2-64 were coded as 257-319 in the arithmetic character
set (256 is used for the hash table). Offsets were sent without
weights. 'Long' shows the results using the longer of the two
matches (with ties going to the hash table) and no length
optimization (compares to 'Merge' rather than 'Smart'). This
achieved a minor improvement in the first two files, and a major
improvement in WORDLIST.
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
Long | 28086 | 35397 | 522226 |
Ties | 28829 | 35880 | 493510 |
Best | 25785 | 34833 | 448995 |
Tree | 25498 | 34410 | 465510 |
Lags | 25397 | 34330 | 415494 |
16K | 24489 | 34218 | 415349 |
-------+--------------+-------------+--------------+
'Ties' show the results of the above method when ties go to
the sliding window instead of the hash table. This demonstrates
the need for a more intelligent method for chosing which match to
use, since the results are highly data dependent.
'Best' shows the results of using a Viterbi like algorithm
to determine the best match & length (the same one used in
'LZA2'). This method looks at every possible way to compress the
next 64 bytes, at least until all paths have the same first step.
Of course, this approach is very slow even with the hash table.
'Tree' uses a binary search tree together with the 'Best'
algorithm. Instead of looking at the last 64 characters, it
uses text pointers corresponding to the last 64 sequences coded.
'Lags' uses counts of how often a match was found based on
how far it was from the location matched (sort of like
autocorrelation data). As shown in the appendix of the previous
report, these counts vary from file to file.
Experiments with the buffer sizes found that the best
results in general were for a 64 entry FIFO buffer and a 16K
entry HASH table. These results are shown as '16K.'
At this point we are able to beat ZIP on two of the three
files. Unfortunately, nothing seems to help the third file much.
Increasing the size of the FIFO buffer helps a little. The
problem with this file is that it doesn't compress well at all,
and the matches found tend to be very short.
Exhaustion
Unfortunately, I was unable to find any way to achieve a
substantial inprovement over the above results. However, the
methods developed during these experiments should be applicable
to improving the speed using the old LZA algorithm.